The Distributed Network Processor: a novel off-chip and on-chip interconnection network architecture Andrea Biagioni, Francesca Lo Cicero, Alessandro Lonardo, Pier Stanislao Paolucci, Mersia Perra, Davide Rossetti, Carlo Sidore, Francesco Simula, Laura Tosoratto and Piero Vicini (N o 5? < o > m in en o (N > X INFN Roma c/o Physics Dep. "Sapienza" Univ. di Roma, p.le Aldo Moro, 2 - 00185 Roma, Italy Email: [firstname.lastname] @romal .infh.it Abstract — One of the most demanding challenges for the designers of parallel computing architectures is to deliver an efficient network infrastructure providing low latency, high bandwidth communications while preserving scalability. Besides off-chip communications between processors, recent multi-tile (i.e. multi-core) architectures face the challenge for an efficient on-chip interconnection network between processor's tiles. In this paper, we present a configurable and scalable architecture, based on our Distributed Network Processor (DNP) IP Library, targeting systems ranging from single MPSoCs to massive HPC platforms. The DNP provides inter-tile services for both on-chip and off- chip communications with a uniform RDMA style API, over a multi-dimensional direct network — see |1| for a definition of direct networks — with a (possibly) hybrid topology. It is designed as a parametric Intellectual Property Library easily customizable to specific needs. The currently available blocks im- plement wormhole, deadlock-free packet-based communications with static routing. The DNP offers a configurable number L, N and M of ports — respectively intra-tile I/O ports to ensure connections among elements within the same computational tile, on-chip communication ones to link different tiles onto the same silicon die, and off-chip communication inter-tile I/O ports to link those belonging to different dies. — Because of the fully switched architecture, the DNP may sustain up to L + N + M packet transactions at the same time. The DNP has been integrated into the design of an MPSoC dedicated to both high performance audio/video processing and theoretical physics applications. We present the details of its architecture and show some promising results we obtained on a first preliminary implementation. I, Introduction In the '70s, the ever-rising complexity of the numerical applications employed by the scientific community started for the first time to overcome the computing power offered by any single-processor computer platform. The need for higher computational capability together with both scaling of silicon technology and power constraints, bolstered investigation in the field of High Performance Computing (HPC). From the beginning, the main issues related to HPC were scalability and robustness, in particular for off-chip and off-board com- munication. A similar scenario is now unfolding in the embedded world. The semiconductor industries, due to limitations similar to those the HPC community faced in the past, are now all shift- ing their interests from a monolithic processor design approach to Multi Processor System on Chip (MPSoC) 0. In this case, the challenge is to provide an efficient communication infrastructure when traditional backbone bus systems fail on guaranteeing full scalability and enough bandwidth between several tens of on-chip processing units (micro-controllers, DSPs, accelerators, etc). Nowadays we are observing a con- vergence between the embedded and the HPC systems, as the single HPC computational node becomes more and more often a MPSoC. In this scenario we propose a novel embeddable and fully synthesizable hardware block, the Distributed Network Pro- cessor (DNP), that provides a fully scalable and configurable network infrastructure (the DNP-Net) for both on-chip and off-chip interconnection, suitable for platforms scaling from a single MPSoC to a massive HPC system. In the following, the elementary processing units con- nected by the DNP are called tiles, and multiple potentially- heterogeneous tiles can be laid out on a single chip. A single tile may be very simple — e.g. consisting of a single fiP plus the DNP - or quite complex — e.g. a /iP, one or more DSPs plus the DNP, - but every one of them contains a DNP unit. The hierarchy may be further deepened: multiple multi-tile chips may be assembled on a processing board, and multiple processing boards plugged in a rack and wired together to build a high-performance HPC parallel system. The DNP is connected to the other devices inside the tile via a set of so-called intra-tile interfaces, while inter-tile interfaces are employed to connect a tile to the other ones, which by the way may (on-chip) or may not (off-chip) reside onto the same chip. A distinguishing feature of ours is that the same set of RDMA-like communication primitives can be uniformly em- ployed to address both on-chip and off-chip tiles. Traditionally RDMA-capable interconnects — e.g. Infiniband, Myrinet — are quite expensive as of hardware resources, needing hun- dreds of megabytes of memory, and high-speed offloading processors to properly handle the complexities of the lay- ered protocol stack or just the virtual memory translations of standard operating systems (Linux, Windows), running on commodity processors (x86, PowerPC, etc.)- On the one hand, typical HPC parallel numerical applications need full control of the computing devices, badly tolerating multi-users environments while being really allergic to swapping/paging regimes — i.e. they use no more than 99% of physical system memory. — On the other hand, the data processing and real-time part of embedded MPSoC applications need locked physical buffers used as targets of DMA transfers to/from data acquisition devices, while the virtual memory support is really used only for the traditional part of the application — i.e. the user interface, configuration logic, etc. — That is why most of the times the memory address translation features — e.g. virtual-to-physical mapping units or MMUs — of commercial processors, even if available, can be turned off, as is the case on most embedded operating systems (vxWorks, uCLinux) or even IBM BlueGene/L |4). Having this in mind, the DNP has been optimized for the no-memory address translation case, making it comparatively small in size and relatively scarce in used silicon resources, opening the possibility to put multiple DNP instances onto the same chip. As a consequence, the RDMA primitives can be promoted from a low-level API, onto which implementing Message Passing Interface (MPI) — as is the case on clusters with Infiniband, — to a full-fledged system-wide communication API, uniformly targeting both on- chip and off-chip devices. In other words, the same RDMA API can be used throughout the full hierarchy of devices. The paper skips over all DNP software related topics (com- munication libraries, simulators, etc..) which can be found elsewhere 0. It is organized into four sections. In the first one we give an overview of the DNP architecture, describing its modular structure and the available interfaces. In the second section we report on the use of the DNP in the framework of the SHAPES project, which was actually the motivation for its development in the first place. In the next one we collect some results of our work and finally we write down our conclusions. II. DNP Architecture The DNP |7| acts as an off-loading network engine to the tile, performing both on-chip and off-chip transfers as well as intra-tile data moving. At the inter- tile level it operates as a network adapter, while at intra-tile level it can be used as a basic DMA controller useful for moving large data chunks — i.e. between external DRAM and internal static RAM. — The DNP has been developed as a parametric Intellectual Property library, meaning that some fundamental architectural features can be chosen among the available ones, others can be customized at design-time and new ones can be developed. A highly modular design has been employed, separating the architectural core (DNP CORE) from the interface blocks (INTRAT IF and INTERT IF), as shown in Fig[T] The Core and the Interfaces are connected by a customizable number of ports. M Off-Chip ports INTERT IF N On-Chip ports if„ I... I if,„J flF, M-l IF N-J-l IF N-J, IF, t I t I PORT VCH BBBBB TX RX TX RX J PORT VCH PORT BBBBB TX RX TX RX I PORT PORT VCH PORT VCH ARB ENG- RTR REG SWITCH (L + M+N)x( L+M+N ) PORT DNP CORE tx ™ RDMA ctrllLUT " PORT TX RX BB INTRA-TILE MST P„ INTRA -TIL: MSTP 3 INTRA-TILE SLAVE MASTER P MASTER P L _, INTRAT IF L Master ports Fig. 1. A block diagram of the DNP architecture: the modular design separates the architectural core (DNP CORE) from the interface blocks (INTRAT IF and INTERT IF). The inter-tile interfaces can be either off-chip or on-chip, and in either cases they can be equipped with virtual-channels. The intra-tile slave interface is basically used by the software to configure and program the DNP. The intra-tile master interfaces are controlled by the DNP and are used to stream data in and out. This flexibility allows tailoring the DNP architecture to the particular use-case, choosing the topology of the DNP-Net: it is just needed to define the number M of inter-tile off-chip ports, the number N of inter-tile on-chip ports and the routing algorithm to automatically achieve the desired configuration (Fig(2j). For example, in the SHAPES project [8), as we will show our IP library has been used to implement a III in section 3D torus network topology in a multi-tile system architecture. Besides to fit with the required network I/O bandwidth, a suitable number L of intra-tile master ports can be chosen. The DNP architecture is a crossbar switch with configurable routing capabilities operating on packets with variable sized payload. The implementation of virtual channels (9) on in- coming switch ports guarantees deadlock-avoidance. The DNP gathers data coming from the intra-tile ports, fragmenting the data stream into packets (Figj4|i which are forwarded to the destination via the intra-tile or inter-tile ports depending on the requested operation. A. RDMA architecture and communication robustness The DNP RDMA features may be condensed into three concepts: • A Command Queue, implemented as a hardware FIFO queue (CMD FIFO), where the software pushes RDMA commands to be asynchronously executed by the DNP. A DNP command is composed by seven words contain- ing information necessary to perform the required data transport operation. • A Completion Queue (CQ), which lives in the tile mem- ory and is treated as a ring buffer, where the DNP writes events, which are simple data structures, and software reads them. Events are generated as commands are exe- cuted and incoming packets are processed. n T 1 N I P L ►/noC •C E i D N T P L •1 E ► di& T D I L N P *] E T T D I N k T D I N L P E T D I N L P E ►r N ■HI jA_ t. Fig. 2. Examples of on-chip and off-chip network topologies and services offered by reconfiguring the parametric DNP. NET HDR 2 words RDMAHDR 3 words PAYLOAD (max 256 words) FOOTER 1 word NET HDR ■ PKT LENGTH ■ VCHAN SRC DNP DST DNP RDMAHDR - COMMAND (PUTIGETILOOPBACK) - CMD LENGTH - SRC MEMORY-ADDR - DST MEMORY-ADDR PAYLOAD FOOTER - HOPs number - ERROR flags -CRC FTR DATA | HDR [ ■■■=»D= INIT TILE SRC TILE DST TILE Fig. 3. Data transfer example: a RDMA GET operation characterized by different initiator (INIT DNP), source (SRC DNP) and destination (DST DNP). The most common use is with INIT == DST. • A Look-up Table (LUT), a hardware memory block embedded in the DNP which is accessible by software through an intra- tile interface. The supported RDMA commands are described by a com- pact set of parameters: the command code (LOOPBACK, PUT, SEND and GET), the source memory address and DNP (SRC DNP), the destination memory address and DNP (DST DNP), the length in words. The LOOPBACK command describes a memory move operation in which the content of a (local) memory area is copied into another one. One intra-tile interface is used to fetch data and another one to write it to the final destination. The RDMA PUT and SEND commands generate a one-way network transaction; the DNP executes them by reading the local buffer and produces a packet stream targeted to the destination DNP. The RDMA GET commands is a two-way transaction in which the initiator DNP (INIT DNP) generates a request packet destined to the source DNP, which in its turn will generate a data packet stream toward the destination DNP (see Fig(3] for the most general, three-actors situation). Most of the times the destination DNP is the same as the initiator so it collapses to the usual RDMA GET operation as found in other networks. After execution of each command, the DNP optionally writes an event in the CQ, so that the software may acknowledge the command as executed, freeing the source data Fig. 4. DNP packet overview: the packet is made of fixed size header and footer and variable number of payload words. The payload can be up to 256 words. There is an optional space for an integrity check code (CRC) and corrupted packets are flagged by a single bit in the footer. buffer for further use. The buffers which are used as a source in a RDMA com- mand need no special care while those which are destination have to be pre-registered into the LUT by the software. The LUT is organized in records, each one containing the buffer physical start address, length and some flags. When a packet is received, the LUT is scanned in search for an entry matching the packet destination buffer; only in this case the operation is carried on. After writing (PUT and SEND) or reading (GET) the data, a corresponding event is written in the CQ, so that the software may carry on further operations — e.g. deregistering the buffer, sending back some acknowledge packets, etc. — SEND packets are similar to PUT ones with null destination address, so that the first suitable buffer in the LUT is picked up and used as the target buffer. They are vital for the software application to bootstrap the RDMA protocol at the beginning — e.g. after buffers allocation to communicate their address to the other DNPs, — in the eager communication protocol, and more in general to exchange small amounts of data — e.g. protocol packets. — On the other hand, PUT packets are mostly intended for the rendezvous protocol. B. DNP packets A data sending operation may generate one or more out- going packets. The DNP hosts a hardware fragmenter block which automatically cuts a data words stream into multiple packets stream. A packets is made up of a fixed size envelop plus a variable word sized payload (Fig HI. In the current implementation, the header is split into a network (NET HDR), which basically carries routing information, and a RDMA header (RDMA HDR), which is processed only by the destination DNP; the footer hosts the packet integrity code and the corruption flag. Every DNP is uniquely addressed by a 18 bit string, whose interpretation depends on the exact details of the network topology; address decoding is done in the router module and must be customized accordingly. For instance, in a 3D Torus network those bits can be evenly split into a (x, y, z) triplet, while on a NoC based design there could be an additional internal coordinate, i.e. a 4-tuple like (x, y, z, w). C. Reliability guarantees The DNP architecture lives on some basic assumptions: • Neither the destination DNP nor the in-between transient DNPs are allowed to drop packets. • Each packet is always reliably transferred to its destina- tion. • Packet corruption is only allowed in the payload and has to be detected and marked in the footer, so that it can be handled by the application. • There is no low level protocol to signal packet delivery, e.g. by n/ack packets. If necessary, it can be implemented by the application. These assumptions turn into requirements for the hardware architecture; it has to be robust enough to protect at least the packet envelop (header and footer), e.g. to avoid bad routing due to corrupted headers. On-chip bit error rate (BER) is assumed negligible, at least at the single tile level with mature silicon processes. Most of the current on-chip inter- tile communication technologies (NoC) claim to control the BER on their own. So the inter-tile off-chip interfaces are the one mostly hit by the above requirements: they have to employ suitable flow-control techniques to avoid packet dropping; some form of redundancy has to be employed to protect the packet envelop from corruption, e.g. using checking and retransmission techniques. D. Core The modular approach used to design the DNP architecture is also applied to the DNP core (Fig[T|, therefore its main functionality is split in several separated modules. Devices inside the tile require DNP primitives issuing commands via the the intra-tile slave interface (INTRA-TILE SLAVE). The Engine (ENG) fetches commands from the CMD FIFO and uses them to fill out the packet header. The payload data are read by an intra-tile transaction using information in the RDMA Controller block (RDMA Ctrl) and the newly created packets are forwarded through the Switch port towards the proper inter-tile ports PORT_VCH or PORT. The routing logic (RTR) configures the SWITCH paths be- tween the DNP ports, sustaining up to L+M+N simultaneous packet transactions. If more than one packet requires the same port, the arbiter block (ARB) applies the arbitration policy to solve the contention. The parametrized number L of intra-tile master ports (INTRA-TILE MST) guarantees the connection among elements within the same tile. On receiving of a packet, an intra-tile transaction is carried out with information from the RDMA Ctrl block, which wraps RDT Tile X+ X- Y+ Y- Z+ Z DSP M|| S iXAXXi DNP DNI S M M AHB Multilayer M M S M S I RISC POT DXM (e.g debug, JTAG) Fig. 5. The SHAPES RISC+DSP+DNP Tile (RDT) is an example of integration of the DNP in a commercial environment (Atmel Diopsis 940) to provide networking capabilities. In this drawing, the M and S boxes are respectively master and slave interfaces. DPM and DDM are DSP Program and Data memories. POT is the set of peripherals. DXM is the external data memory controller. the LUT inside. Each RDMA transaction is followed by completion operation (see sect. II-Ai. Besides the CMD FIFO, both a set of registers (REG) and the RDMA Look-Up Table (LUT) are accessible through the intra-tile slave port. The registers are used to expose status information and to configure the DNP functionality; i.e. hand-shake protocols among blocks are often time-out based with exception rising, so that time-out thresholds, as well as arbitration logic choice and the port priority scheme, are configurable this way. Moreover, some registers allow for resetting and dis/enabling of blocks inside the DNP at run time by software. E. Intra-tile and inter-tile Interfaces As previously described, the modularity depicts the DNP design. The DNP may be adapted to different environments with minor rewriting efforts, being the interfaces completely independent from the core. Given network topologies lead to different protocol choices as well as error checking and correction requirements. Keeping some of these features out of the Core module makes it smaller and cleaner, moving the potential complexity to the inter-tile or intra-tile interfaces. At intra-tile level the DNP exposes a proprietary bus pro- tocol while at inter-tile ports a FIFO like signaling is used. The intra-tile interfaces are in charge of translating the DNP transactions into the particular protocol used inside the tile. These interfaces are specific to the different bus architectures — e.g. AMBA-AHB, AMBA-AXI, PCI, PowerPC, etc. — In our library we provide AMBA-AHB adaptors for intra- tile communication, either for the master (MASTER) and the slave (SLAVE) interfaces. On the other hand, PCI/PCI- X/PCI-e commercial cores can be easily used by developing customized wrapper blocks. The implemented INTRA-TILE SLAVE adapter maps the registers, the LUT and the command queue. DM' RDT Fig. 6. 8-RDTs SHAPES chip arranged in a 3D lattice. In each chip the RDT tiles are connected by the ST-Spidergon NoC. The inter-tile interfaces are responsible for on-chip and off- chip interconnection. While the DNP can be connected to highly specialized on-chip network infrastructure — e.g. a NoC, — however the DNP infrastructure allows for point- to-point interconnection with simple interface blocks. A very wide range of solutions may be applied for off- chip communication depending on the requirements. In our case study (see sect. Ill I, the on-chip communication is NoC based while for off-chip we provide a bidirectional Serial- izer/Deserializer (Ser/Des) with error check, DC-balance and re-transmission capability. The NoC used is the ST-Spidergon, while the Ser/Des block was a proprietary solution. III. A Case Study: the SHAPES computing ARCHITECTURE The DNP was developed in the context of the SHAPES European project, proposing a scalable HW/SW design style for future CMOS technologies. Besides the DNP, a typical SHAPES tile contains a VLIW floating-point DSP (Atmel mAgicV) with its on- chip Distributed Data/Program Memory (DDM-DPM), a RISC (ARM9), an interface to Distributed External Memory (DXM) and a set of peripherals called the Peripherals on Tile (POT) e.g. JTAG, Ethernet, etc.; in SHAPES jargon, this tile config- uration is called RISC+DSP+DNP Tile (RDT) (Figj5). The SHAPES network connects on-chip and off-chip tiles, weaving a distributed packet-switching network. The SHAPES software tool-chain provides a simple and efficient programming environment for tiled architectures. SHAPES proposes a layered system software (8), which dramatically helps the programmer to harness the intrinsic complexity of fully scalable multi-processor systems. A. The DNP in the SHAPES architecture In the SHAPES framework the intra-tile interfaces are AMBA-AHB standard connected to the DSP and RISC through an AMBA-AHB multilayer bus (FigjSl. For on-chip communications, the SHAPES architecture avails itself of the ST-Spidergon Network-on-Chip (10) flj] [El to which the DNP is connected by a specific interface, called the DNP Network-on-Chip Interface (DNI). The 3D Torus topology has been adopted for off-chip networking (Fig|6]i, with all node- connecting bidirectional links, which needs a total of six inter- tile interfaces per DNP. This configuration of ports gives the particular SHAPES tile a render of the DNP with 1 = 2, M = 6 and N = 1. The DNP applies a deterministic routing policy to imple- ment communications on the 3D torus network. The coordi- nates evaluation order (e.g first Z is consumed, then Y and eventually X ...) can be chosen at run-time by writing into a specialized priority register; in this way, the routing scheme is configurable to a certain extent. 1) The On-Chip Interface: the DNI is the on-chip bidi- rectional interface handling DNP transmissions to/from the ST-Spidergon NoC. The communication protocol implied is a hand-shake protocol based on a request/grant policy. This interface includes a sub-module that verifies data by means of a Cyclic Redundancy Check (CRC). During the packet delivery process a CRC is computed and transmitted together with the footer. On receiving, that CRC is recalculated and checked, so in case of transmission errors a bit in the footer is set and the packet goes on its way. The software can detect packet corruptions by analysing the footer. The ST-Spidergon NoC implements deadlock avoidance by its own, therefore no virtual channels are necessary on the DNP port side. 2) The Off-Chip Interface: the inter-tile off-chip interface has a parallel clock SerDes architecture, employing Double Data Rate signaling in order to reduce packet transmission la- tency. Special encoding and a DC-balance block guarantee the quality of the transmission line. The balancing is performed inverting the transmitted word to equalize the number of 1 and bits in time. Furthermore, the inter- tile off-chip interface implements the mesochronous clocking technique in order to handle the clock-phase skew between communicating DNPs. It manages the data flow encapsulating the DNP packets into a light, low-level protocol able to detect transmission errors via CRC, and includes a memory buffer to re-transmit the header and the footer in case of transmission errors. Therefore the protocol assures the delivery of the packet, avoiding non- recoverable situations where badly corrupted packets (with errors in the header or footer) pose threat to the global routing. In case of packets with payload errors (signaled by the footer), the software communication library is in charge of handling the situation. The chosen CRC polynomial generator is the same used for the inter-tile on-chip interface: it is the industry-standard, well-known CRC- 16. Past experiences collected on a custom developed massively parallel computing architecture lfT3llfT4ll has given us insight into the bit error rate of a typical off- chip LVDS link, while proving the CRC- 16 as a satisfactory solution. W ►pDNP ■ NoC X- MatrixJ VL1W Matrix [ VL1W (a) On-chip communication ST-NoC based (MTNoC) (b) On-chip communication 2D-DNP based (MT2D) Fig. 7. Details of the explored solutions: MTNoC and MT2D B. Exploring the DNP high configurability Within the SHAPES project, thanks to the high level of parametrization offered by the DNP, we were able to propose different solutions for the inter-tile on-chip network. Besides the ST-Spidergon NoC (MTNoC), an alternative solution has been proposed that involves an on-chip 2D mesh network, obtained using the DNP inter-tile ports connected point-to- point (MT2D). This experiment allowed having two solutions suitable for possibly different application requirements (FiglTjl. IV. Results The DNP architecture has been validated in the SHAPES project by using the DNP TLM SystemC lfl5l model integrated in different multi-tile processor configurations. In particular, the DNP was employed in benchmarking the SHAPES archi- tecture on a kernel code [16| for Lattice Quantum Chromo Dynamics (LQCD), and tested on a system configuration of 8 RDTs arranged in a 2 x 2 x 2 3D topology (one of those depicted in FigJ7]». At the same time, an RTL model (written in VHDL) has been used to simulate and validate the DNP implementation, with both ad-hoc stimuli test vectors and mixed SystemC- VHDL simulations. Along with preliminary measures of intra- tile and inter-tile bandwidth and latency, we estimate area consumption and power dissipation on a early place-and-route layout. In the following, bandwidth figures are meant unidi- rectional, i.e. for each direction, unless explicitly specified. Depending on the particular intra-tile interface adopted in the destination design — e.g. matching a bus with reduced width or with lower frequency, — the DNP intra-tile port is able to sustain up to 1 word/cycle (1 word equals 32 bits), so the resulting intra-tile bandwidth is BWi nt = L x 32 = 64bit/cycle, roughly 4GB/s at 500MHz, or 4 + 4GB/s bidirec- tional bandwidth. The latency for intra-tile communications — TILE read m write M CMD SWITCH L2 1 mtQV DNP I I I I I \ , \^7(init=siT=dst) Fig. 8. Details on the timing measurement solution adopted for a LOOP- BACK operation. L\ represents the time elapsed between the command issuing and the beginning of the read intra-tile transaction, while L^ is the time necessary to complete LOOPBACK operation and to begin the write intra-tile operation. i.e. the time interval elapsed from the LOOPBACK command reaching the CMD FIFO to the first outgoing data surfacing onto the bus, as in FigjS] — is Li nt — L\ + L2 — 100 cycles, equal to 200ns at the target frequency. Inter-tile on-chip ports are designed to be connected by point-to-point parallel links and support a bandwidth of 1 word/cycle as well. Because of the DNP distributed network architecture, it is possible to arrange tiles inside the chip in a way to minimize the wire lengths and the dissipated power due to data transmission. In this ambit a parallel on-chip link is still a valid choice considering the trade-off between power dissipation and bandwidth, as the design of many NoC's demonstrates. While in theory inter-tile off-chip ports may also sustain up to 1 word/cycle network load, the actual off-chip channel bandwidth is inherently dependent on the serialization factor SRC TILE DST TILE Fig. 9. Single-hop PUT command with 1 word payload. L\ represents the time elapsed between the command issuing and the beginning of the read intra-tile transaction; L2 is the time necessary to transmit the first header word of the packet to the appropriate inter-tile interface across the switch; L3 is the transmission time over the serialized off-chip interface toward the DST TILE; L4 is the time down to the intra-tile write operation. L 4 — w — . s L 2 ~ Li , 50 100 150 200 250 300 350 cycle Fig. 10. The broken-out timings for the PUT command in FigRJlare graphed here. 3 W~ '\* 50 100 150 200 250 300 350 cycle Fig. 11. Double-hop timings for a PUT command. L% is the time elapsed for the hop across an additional DNP, containing the trasmission time over the second serialized off-chip interface; it partially overlaps the L3 for the first hop. adopted — defined as the ratio between the DNP internal data width and the number of serial lines — on the link. In the SHAPES implementation we chose a serialization factor equal to 16, in order to minimize the number of double data TABLE I Place&Route trials for both MTNoC and MT2D DNP AT 500MHZ 45nm VLSI TECHNOLOGY. NOTE THAT IN BOTH DESIGNS NOT ALL PORTS ARE USED EVEN THOUGH THEY ARE PRESENT AND ACCOUNTED FOR. MTNoC DNP AREA DOES NOT ACCOUNT FOR THE ADDITIONAL ON-CHIP NOC COMPONENT. MTNoC DNP MT2D DNP on-chip ports (N) 1 3 off-chip ports (M) 1 1 estimated area 1.30mm 2 1.76mm 2 estimated power 160mW 180mW rate physical wires (and corresponding multi-tile processor chip pin count), to allow for high transmission frequencies and distances of the order of some meters, and to simplify the design of the processing board PCB. The above choice leads to a off-chip network bandwidth equal to 4 bit/cycle. Taking into account all the factors, the inter-tile on-chip and off-chip bandwidth are BW on -chip = N x 32 bit/cycle and BW ff- c hip = M x 4 bit/cycle per direction. The associated single-hop latencies (Fig|9]and 10 1, defined as the time interval elapsed since the writing of the RDMA command on the sender DNP CMD FIFO until the first word of the packet header is written on the destination DNP intra- tile interface, are L on ^ c hi P = L\ + Li + L4 ~ 130 and L ff_ c f lip = L\ + L 2 + L3 + L4 ~ 250 cycles, respectively 260ns and 500ns at 500MHz. The cost in latency of an addi- tional hop (FigfTT) over an off-chip interface, for example to reach the nextneighbor DNP, is 100 cycles, which is less than the naive guess of Ul + L3 ~ 150 cycles thanks to wormhole routing. By the way the relative high value of l ff-chip is influenced by the latency introduced by serialization and is dependent on the serialization factor. We performed the Place&Route trial of both the MTNoC and MT2D inter-tile interconnection infrastructure, introduced using 45nm silicon process at 500MHz. The III-B in sec. current DNP area estimates in the hypothesis of MTNoC and the MT2D architectures are respectively 1.30mm 2 and 1.76mm 2 ; the larger occupation area for the latter is mainly due to the higher number of on-chip ports (3 ports vs. 1 port), implying a more complex switch matrix architecture and a larger number of DNP data buffers, a complexity which is moved in the NoC block in the MTNoC case. Furthermore it must be taken into account that these esti- mates were yielded by Place&Route trials where the DNP data buffers were synthesized using registers in place of memory macros, currently not available in the silicon process library. We expect to halve this area in the final design. Estimated figures for the power consumption for a single DNP in the RDT computational tile are 160mW in the MTNoC case and 180mW in the MT2D case, with the same silicon process and frequency, amounting about to 1/4 of the tile dissipation figure. A motherboard built by 32 multi-tile processor chips (where one chip includes 8 RDTs) would provide 1 Tera-Flops of floating point computing power with roughly 6001^ of peak power consumption. V. Conclusion and future work In the SHAPES project framework, we proved the paramet- ric and configurable architecture of the DNP to be effective for architectural exploration of on-chip and off-chip intercon- nection networks, to rapidly yield an efficient design. As of today, the inter-tile off-chip interfaces are imple- mented with a serialization factor equal to 16; there is room for considerable improvements in bandwidth and latency, either reducing the serialization factor to 8 or increasing the switching frequency of the off-chip physical links. Using the target 45nm silicon process technology we expect to double the current switching frequency pushing it up to 1GHz. The routing function of the network and the set of supported commands are currently implemented by hard-coded logic blocks; the option to instead have a /xP in its place is currently under study. This replacement would be a significant step ahead towards a more thoroughly programmable DNP. Moreover in future releases of the DNP we also plan to strengthen the robustness of the network infrastructure inte- grating the minimal hardware redundancy needed to support the well-known specific fault-tolerant routing methods for torus-based, point-to-point network ifTTl ifTHIl . Considering our know-how on massively parallel architec- tures, collected in more than 20 years of research and devel- opment in the field, we think that DNP can be a key feature to assemble HPC systems that target multi-dimensional lattice problems of large proportions (i.e. LQCD) [19), for which high demands of parallel processing and data communication between processing nodes are considered critical. Acknowledgment The authors would like to thank Dr. R. Ammendola, for useful discussions and vital activities on seminal work, and Prof. N. Cabibbo for being the benign founder of the research group in which all of us do work. References  J. Duato, S. Yalamanchili and L. Ni, Interconnection Network - An en- gineering approach, Morgan Kaufmann Publishers Inc., San Francisco, CA, 2003.  W.J. Dally and S. Lacy, VLSI Architectures: Past, Present and Future, Proc. Advanced Research in VLSI Conf., IEEE Press (1999) Pages 232- 241.  M.B. Taylor et at., The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Pourpose Programs, IEEE Micro 2002.  J. Moreira et al., The Blue Gene/L Supercomputer: A Hardware and Software Story , International Journal of Parallel Programming, n.3. Pages 181-206,June 2007.  K. Popovici, X. Guerin, F. Rousseau, P.S. Paolucci and A. A. Jerraya, Platform-based software design flow for heterogeneous MPSoC, Trans- actions on Embedded Computing Systems, Vol.7, Pages 1-23, New York, NY, 2008, USA.  L. Thiele, I. Bacivarov, W. Haid and K. Huang, Mapping Applications to Tiled Multiprocessor Embedded Systems, Proc. 7th Intl Conference on Application of Concurrency to System Design (ACSD 2007), IEEE Computer Society, Pages 29-40, Bratislava, Slovak Republic, July 2007.  P.S. Paolucci, P. Vicini et al., Introduction to the Tiled HW Architecture of SHAPES, Proc. Design, Automation and Test in Europe (DATE'07), Vol.1, Pages 77-82, France, Nice, April 2007.  P.S. Paolucci, A. A. Jerraya, R. Leupers, L. Thiele, and P. Vicini, Shapes: a tiled scalable software hardware architecture platform for embedded systems, Proceedings of the 4th international Conference on Hard- ware/Software Codesign and System Synthesis CODES+ISSS '06 (Seoul, Korea), ACM Press, Pages 167-172, 2006.  W.J. Dally and C.L. Seitz, Deadlock-Free Message Routing in Multi- processor Interconnection Networks, IEEE Trans. Comput. 36 (1987) 547-553.  M. Coppola, M. D. Grammatikakis, R. Locatelli, G Maruccia and L. Pieralisi, Design of Cost-Efficient Interconnect Processing Units: Spidergon STNoC, CRC press, Taylor & Francis group, 2008.  F. Palumbo, S. Secchi, D. Pani, L. Raffo, A Novel Non-Exclusive Dual-Mode Architecture for MPSoCs-Oriented Network on Chip Designs, Proceedings of the International Workshop on Systems, Architectures, Modeling, and Simulation, Samos, Greece, July 21-24, 2008, Pages 96- 105, LNCS 5114.  F. Vitullo, N. E. LInsalata, E. Petri, L. Fanucci, M. Casula, R. Locatelli, M. Coppola, Low-Complexity Link Microarchitecture for Mesochronous Communication in Networks on Chip, IEEE Journal on Transactions on Computers, Vol. 57, No. 9, September 2008.  F. Bodin et al., The apeNEXT project, Nuclear Physics B - Proceedings Supplements, Volumes 106-107, March 2002, Pages 173-176, ISSN 0920- 5632.  R. Ammendola et al., APENet: LQCD clusters a la APE, Nuclear Physics B (Proc. Suppl.) 140 (2005) 826-828  K. Huang, I. Bacivarov, F. Hugelshofer, and L. Thiele, Scalably dis- tributed systemc simulation for embedded applications, In Industrial Embedded Systems, 2008. SIES 2008. International Symposium on, Pages 271-274, June 2008.  M. Luscher, Lattice QCD: From quark confinement to asymptotic freedom, Annales Henri Poincare 4:S197-S210,2003.  Rajendra V. Boppana and Suresh Chalasani, Fault-Tolerant Wormhole Routing Algorithms for Mesh Networks, IEEE Transactions on com- puters, Vol. 44 n. 7, Pages 848-864, July 1995.  Rajendra V. Boppana and Suresh Chalasani, Fault-Tolerant Communi- cation with Partitioned Dimension-Order Routers, IEEE Transactions on Parallel and Distributed Systems, Vol. 10 n. 10, Pages 1026-1039, October 1999.  R. Ammendola, A. Biagioni, S. De Luca, F. Lo Cicero, A. Lonardo, P. S. Paolucci, M. Perra, D. Rossetti, C. Sidore, F. Simula, N. Tantalo, L. Tosoratto, P. Vicini. Computing for Lattice QCD: New developments from the APE experiment, II Nuovo Cimento B, Vol. 123 n. 6-7, Pages 964-968 (2008).